Efficient Entity Resolution with MFIBlocks

نویسندگان

  • Batya Kenig
  • Avigdor Gal
چکیده

Entity resolution is the process of discovering groups of tuples that correspond to the same real world entity. In order to avoid the prohibitively expensive comparison of all pairs of tuples, blocking algorithms separate the tuples into blocks which are highly likely to contain matching pairs. Tuning is a major challenge in the blocking process. In particular, contemporary blocking algorithms need to provide a blocking key, based on which tuples are assigned to blocks. Due to the tuning complexity, blocking keys are manually designed by domain experts. In this work, we introduce a blocking approach that avoids selecting a blocking key altogether, relieving the user from this difficult task. The approach is based on maximal frequent itemsets selection. This approach also allows early evaluation of block quality based on the overall commonality of its members. A unique feature of the proposed algorithm is the use of prior knowledge of the estimated sizes of duplicate sets in enhancing the blocking accuracy. We report on a thorough empirical analysis, using common benchmarks of both real-world and synthetic datasets which exhibit the efficiency of our approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MFIBlocks: An effective blocking algorithm for entity resolution

Entity resolution is the process of discovering groups of tuples that correspond to the same real-world entity. Blocking algorithms separate tuples into blocks that are likely to contain matching pairs. Tuning is a major challenge in the blocking process and in particular, high expertise is needed in contemporary blocking algorithms to construct a blocking key, based on which tuples are assigne...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Introducing An Efficient Set of High Spatial Resolution Images of Urban Areas to Evaluate Building Detection Algorithms

The present work aims to introduce an efficient set of high spatial resolution (HSR) images in order to more fairly evaluate building detection algorithms. The introduced images are chosen from two recent HSR sensors (QuickBird and GeoEye-1) and based on several challenges of urban areas encountered in building detection such as diversity in building density, building dissociation, building sha...

متن کامل

Improvement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination

Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011